options(scipen = 999)
library(purrr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
set.seed(123)

Goal

Here we will perform a multivariate multiple regression analysis to predict a player’s topline basketball in-game statistics using several variables such their salary, age, team, and certain other in-game statistics. This analysis is motivated by the question: what is the relationship between a player’s salary and their in-game performance?

In this analysis, we investigate the inverse of the typical question: instead of asking how a player’s in-game performance predicts their salary, we ask how a player’s salary predicts their in-game performance. This can help us determine how related the salary is to performance, and may lead to insights in if a player is “overpaid” or “underpaid” relative to their performance.

We must use multivariate multiple regression because we have multiple response variables (e.g. points, assists, and rebounds) and multiple predictor variables (e.g. the salary, age, team, and other in-game statistics).

Data

See the data prep markdown for more details.

Exploratory Data Analysis

Multivariate Normality

We recall from lecture that one of the assumptions of linear regression is that the data is multivariate normal. This distribution is characterized by the following properties:

  1. The marginal distribution of each variable is normal
  2. The joint distribution of pairs of variables are normal

These are very helpful because they allow us to check for multivariate normality by checking for univariate normality, rather than considering the challenging task of checking for multivariate normality directly.

We now consider the univeriate cases. We plot the distribution of each numeric column and the salary column in order to visually inspect the distribution. Then, we use the Shapiro-Wilk test to test for normality and check if this matches our visual inspection.

# read the data from csv
per_season_data <- readr::read_csv("data/per_season_data.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   player = col_character(),
##   Season = col_character(),
##   Tm = col_character(),
##   Team = col_character(),
##   Pos = col_character(),
##   Colleges = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
dim(per_season_data)
## [1] 866  31
head(per_season_data)
## # A tibble: 6 x 31
##   player  Season   Age Tm    Team   Pos      Ht    Wt Colleges     G    GS    MP
##   <chr>   <chr>  <dbl> <chr> <chr>  <chr> <dbl> <dbl> <chr>    <dbl> <dbl> <dbl>
## 1 abdela… 1990-…  23.9 POR   Portl… F-C      82   240 Duke      7928    53  1020
## 2 abdela… 1991-…  23.9 POR   Portl… F-C      82   240 Duke      7928    53  1020
## 3 abdela… 1992-…  23.9 POR   Bosto… F-C      82   240 Duke      7928    53  1020
## 4 abdela… 1993-…  23.9 POR   Bosto… F-C      82   240 Duke      7928    53  1020
## 5 abdela… 1994-…  23.9 POR   Sacra… F-C      82   240 Duke      7928    53  1020
## 6 abdela… Career  23.9 POR   (may … F-C      82   240 Duke      7928    53  1020
## # … with 19 more variables: FG <dbl>, FGA <dbl>, FG% <dbl>, 3P <dbl>,
## #   3PA <dbl>, 3P% <dbl>, FT <dbl>, FTA <dbl>, FT% <dbl>, ORB <dbl>, DRB <dbl>,
## #   TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>,
## #   Salary <dbl>
# handle NAs for now
per_season_data <- per_season_data %>% 
    mutate(
        Team = ifelse(is.na(Team), "Other", Team),
        Season = ifelse(is.na(Season), "Other", Season),
        Salary = ifelse(is.na(Salary), mean(Salary), Salary)
    )
# create a list of numeric columns
num_cols <- per_season_data %>% 
    select(where(is.numeric)) %>% 
    names()

# plot the distribution of each numeric column
for (col in num_cols) {
    hist(per_season_data[[col]], main = col)
}

# plot the distribution of the salary
hist(per_season_data$Salary, main = "Salary")

We also present Q-Q plots for the variables, which demonstrats the columns that aren’t normal.

for (col in num_cols) {
    qqnorm(per_season_data[[col]], main = col)
    qqline(per_season_data[[col]])
    # title
    title(main = col)
}

We attempted to perform the Shapiro Wilk test, but it was not possible to perform the test on the size of our data - there is a limit of 5000 observations. So, we will perform the test on a sample of the data, but we will not use this test to make any final conclusions.

# shapiro wilk test
for (col in num_cols) {
    print(col)
    # sample the data of size 5000
    x = sample(per_season_data[[col]], 5000)
    print(shapiro.test(x))
}

We see clearly that many of the variables are not normally distributed and will need to be transformed. Furthermore, the salary distribution features a long tail, with substantial outliers.

Transformations

Depending on the skewness of the data, we can consider polynomial transformations and log transformations. We will consider transformations for the following variables which will be used in the regression analysis:

After investigation, we see that the following transformations are suitable:

per_season_data <- per_season_data %>% 
    mutate(
        PTS = PTS^(1/2),
        AST = AST^(1/2),
        TRB = TRB^(1/2),
        Salary = Salary^(1/10),
        #Age no transform needed
        #Ht no transform needed
        Wt = Wt^2,
        GS = GS^(1/3),
        G = G^(1/2)
        #MP no transform needed
    )

Multivariate Multiple Regression Regression

Now that the data is more suitable for our regression task, we can perform the regression. First, we discuss the characteristics of multivariate multiple regression.

Most notably, multivariate multiple regression is a generalization of multiple regression. In multiple regression, we have a single response variable and multiple predictor variables. In multivariate multiple regression, we have multiple response variables and multiple predictor variables.

To start, we create the following model:

\[Y = \Beta Z\]

# convert some variabels to factor for R's lm function
per_season_data <- per_season_data %>% 
    mutate(
        Season = as.factor(Season),
        Team = as.factor(Team),
        Pos = as.factor(Pos)
    )
mlm1 <- lm(
    cbind(PTS, AST, TRB) ~ Salary + Age + Ht + Wt + Pos + GS + G + MP,
    data = per_season_data
)
summary(mlm1)
## Response PTS :
## 
## Call:
## lm(formula = PTS ~ Salary + Age + Ht + Wt + Pos + GS + G + MP, 
##     data = per_season_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14638 -0.18852 -0.00722  0.18709  0.76131 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)  0.097799276  0.559859999   0.175             0.861370    
## Salary       0.035454466  0.015812087   2.242             0.025211 *  
## Age         -0.038773455  0.005459598  -7.102    0.000000000002661 ***
## Ht          -0.006335313  0.006439966  -0.984             0.325527    
## Wt           0.000006346  0.000001748   3.631             0.000300 ***
## PosC-F       0.386460141  0.049725053   7.772    0.000000000000023 ***
## PosF         0.106641824  0.046375383   2.300             0.021724 *  
## PosF-C       0.148916003  0.045813807   3.250             0.001199 ** 
## PosF-G       0.314883217  0.054999630   5.725    0.000000014469906 ***
## PosG         0.229710890  0.073478460   3.126             0.001833 ** 
## PosG-F       0.115774720  0.061644164   1.878             0.060719 .  
## GS          -0.059492747  0.016010405  -3.716             0.000216 ***
## G            0.004485254  0.000629516   7.125    0.000000000002273 ***
## MP           0.002301965  0.000075024  30.683 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2873 on 824 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.895,  Adjusted R-squared:  0.8933 
## F-statistic: 540.2 on 13 and 824 DF,  p-value: < 0.00000000000000022
## 
## 
## Response AST :
## 
## Call:
## lm(formula = AST ~ Salary + Age + Ht + Wt + Pos + GS + G + MP, 
##     data = per_season_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60364 -0.13881 -0.01604  0.14352  0.64564 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)  5.433722509  0.425988596  12.756 < 0.0000000000000002 ***
## Salary      -0.001347760  0.012031166  -0.112               0.9108    
## Age         -0.023261040  0.004154122  -5.600    0.000000029289078 ***
## Ht          -0.055841745  0.004900068 -11.396 < 0.0000000000000002 ***
## Wt          -0.000008881  0.000001330  -6.678    0.000000000044579 ***
## PosC-F       0.094221978  0.037835005   2.490               0.0130 *  
## PosF        -0.259636830  0.035286293  -7.358    0.000000000000452 ***
## PosF-C      -0.168620414  0.034858999  -4.837    0.000001571373509 ***
## PosF-G      -0.057699633  0.041848347  -1.379               0.1683    
## PosG        -0.121915579  0.055908595  -2.181               0.0295 *  
## PosG-F      -0.261859860  0.046904067  -5.583    0.000000032120049 ***
## GS           0.024231706  0.012182063   1.989               0.0470 *  
## G            0.000912071  0.000478988   1.904               0.0572 .  
## MP           0.000787670  0.000057084  13.798 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2186 on 824 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.8361, Adjusted R-squared:  0.8335 
## F-statistic: 323.3 on 13 and 824 DF,  p-value: < 0.00000000000000022
## 
## 
## Response TRB :
## 
## Call:
## lm(formula = TRB ~ Salary + Age + Ht + Wt + Pos + GS + G + MP, 
##     data = per_season_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67856 -0.16327 -0.00599  0.14571  0.56670 
## 
## Coefficients:
##                  Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept) -1.7834089684  0.4172853723  -4.274  0.00002145890238160 ***
## Salary       0.0239028111  0.0117853617   2.028               0.0429 *  
## Age         -0.0287838037  0.0040692502  -7.073  0.00000000000322845 ***
## Ht           0.0422456728  0.0047999566   8.801 < 0.0000000000000002 ***
## Wt          -0.0000009692  0.0000013028  -0.744               0.4571    
## PosC-F       0.0645121174  0.0370620109   1.741               0.0821 .  
## PosF        -0.2823501999  0.0345653715  -8.169  0.00000000000000117 ***
## PosF-C      -0.0804293081  0.0341468073  -2.355               0.0187 *  
## PosF-G      -0.3049418836  0.0409933576  -7.439  0.00000000000025535 ***
## PosG        -0.5793483637  0.0547663461 -10.579 < 0.0000000000000002 ***
## PosG-F      -0.4408658778  0.0459457864  -9.595 < 0.0000000000000002 ***
## GS           0.0254978069  0.0119331759   2.137               0.0329 *  
## G            0.0009863719  0.0004692023   2.102               0.0358 *  
## MP           0.0007324771  0.0000559180  13.099 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2141 on 824 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.8235, Adjusted R-squared:  0.8208 
## F-statistic: 295.8 on 13 and 824 DF,  p-value: < 0.00000000000000022

Interpretation

mlm1$coefficients
##                         PTS             AST             TRB
## (Intercept)  0.097799276044  5.433722508646 -1.783408968368
## Salary       0.035454465501 -0.001347760411  0.023902811122
## Age         -0.038773454576 -0.023261039752 -0.028783803703
## Ht          -0.006335313218 -0.055841745027  0.042245672768
## Wt           0.000006346443 -0.000008881267 -0.000000969235
## PosC-F       0.386460141094  0.094221977869  0.064512117403
## PosF         0.106641823562 -0.259636830004 -0.282350199945
## PosF-C       0.148916003387 -0.168620413563 -0.080429308144
## PosF-G       0.314883217453 -0.057699632982 -0.304941883580
## PosG         0.229710889778 -0.121915579031 -0.579348363663
## PosG-F       0.115774719513 -0.261859860325 -0.440865877794
## GS          -0.059492746679  0.024231706283  0.025497806949
## G            0.004485253829  0.000912070851  0.000986371888
## MP           0.002301964636  0.000787669714  0.000732477057

We will now consider each predictor variable and see how it affects the response variables.

  • Salary: We see salary has a signicant effect at the 0.05 level only on Points and Rebounds, but does not have a significant effect on Assists. This is suprising, as it suggests that some in-game stats may not be as related to salary as others. It may suggest that the assist-seeking players may not be as highly compensated for that ability. The model confirms that the relatonship between salary and in-game statistical performance is positive.
  • Age: This is a very significant variable and is negative for all three response variables. This suggests that as players become older, their statistical output goes down. This is not surprising.
  • Ht: This variable is not significant for points but it is for assists and rebounds. Height has a positive relationship with rebounds and a negative relationship with assists. This makes sense because certain basketball positions are determined by height and these positions lead to specific in-game roles which may lead to more assists or more rebounds, but all positions can score points. Taller players are more likely to be centers, who are more likely to get rebounds, and shorter players are more likely to be guards, who are more likely to get assists.
  • GS and G: These variables are similar, so we will group them together. They are highly significant for the points response variable, but not so much for the assists and rebounds, which is a little surprising. Furthermore, the GS variable has a negative coefficient with points - highly surprising. This relationship may require further investigation.
  • MP: This variable is significant and positive for all three response variables. This makes sense because players who play more minutes are more likely to score more points, get more assists, and get more rebounds.

Model Fit

Before we proceed further, let us consider the fit of the model. We plot the residuals vs fitted values for each of the three response variables and check the normality of the residuals.

plot(mlm1$fit[,1],mlm1$resid[,1])

plot(mlm1$fit[,2],mlm1$resid[,2])

plot(mlm1$fit[,3],mlm1$resid[,3])

qqnorm(mlm1$resid[,1])
qqline(mlm1$resid[,1])

qqnorm(mlm1$resid[,2])
qqline(mlm1$resid[,2])

qqnorm(mlm1$resid[,3])
qqline(mlm1$resid[,3])

We see that this initial model is pretty good, but perhaps we can do better. We can consider alternate subsets of variables to include in the model. And compare them using the AIC metric.

Model Selection

We can use several methods to help us choose the best model. It is preferable to include the least number of predictor variables possible, in order to have an explanable model, avoid overfitting, etc, but to balance this with the need to have a model that is sufficiently accurate.

For example, we may use the AIC, BIC, or adjusted R-squared to help us choose the best model. We calculate the AIC for our initial model.

# define AIC because the function doesnt support multiple respnses
# (resid sum of squares cross product matrix)/n
# AIC = n * log(det(Sigma_d)) - 2p * d
# where p is the number of parameters and d is the number of responses

n = nrow(per_season_data)
# mlm1 number of parameters
p = 8 
d = 3
aic = n * log(det(crossprod(mlm1$residuals))) - 2 * p * d
aic
## [1] 9760.221

We first consider the most simple model of only using Salary variable to predict the in-game statistics.

salary_mlm <- lm(
    cbind(PTS, AST, TRB) ~ Salary,
    data = per_season_data
)
summary(salary_mlm)
## Response PTS :
## 
## Call:
## lm(formula = PTS ~ Salary, data = per_season_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71942 -0.54134 -0.05293  0.56652  2.16099 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  0.55888    0.16274   3.434             0.000623 ***
## Salary       0.53940    0.03701  14.573 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8199 on 851 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.1997, Adjusted R-squared:  0.1988 
## F-statistic: 212.4 on 1 and 851 DF,  p-value: < 0.00000000000000022
## 
## 
## Response AST :
## 
## Call:
## lm(formula = AST ~ Salary, data = per_season_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23091 -0.41696 -0.06474  0.39725  1.48099 
## 
## Coefficients:
##             Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)  0.45558    0.10444   4.362 0.0000144555437827 ***
## Salary       0.19357    0.02375   8.149 0.0000000000000013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5262 on 851 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.07239,    Adjusted R-squared:  0.0713 
## F-statistic: 66.41 on 1 and 851 DF,  p-value: 0.000000000000001304
## 
## 
## Response TRB :
## 
## Call:
## lm(formula = TRB ~ Salary, data = per_season_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17872 -0.29780 -0.02929  0.30152  1.51288 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  0.66224    0.09472   6.992      0.0000000000055 ***
## Salary       0.28135    0.02154  13.060 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4772 on 851 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.167,  Adjusted R-squared:  0.166 
## F-statistic: 170.6 on 1 and 851 DF,  p-value: < 0.00000000000000022
# aic
n = nrow(per_season_data)
p = 1
d = 3
aic_salary = n * log(det(crossprod(salary_mlm$residuals))) - 2 * p * d
aic_salary
## [1] 13572.94

We see that the AIC value is much higher for the salary-only model, so we will not consider this model further. We will consider one more simpler model, which only includes the Salary and demographic variables.

salary_demographic_mlm <- lm(
    cbind(PTS, AST, TRB) ~ Salary + Age + Ht + Wt,
    data = per_season_data
)

summary(salary_demographic_mlm)
## Response PTS :
## 
## Call:
## lm(formula = PTS ~ Salary + Age + Ht + Wt, data = per_season_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1210 -0.5361 -0.1170  0.5254  2.2587 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept) -0.484516484  0.953805609  -0.508              0.61160    
## Salary       0.533184613  0.035949642  14.831 < 0.0000000000000002 ***
## Age          0.067685732  0.010181618   6.648      0.0000000000532 ***
## Ht          -0.002649727  0.012998656  -0.204              0.83852    
## Wt          -0.000010486  0.000003887  -2.698              0.00712 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7829 on 848 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.2729, Adjusted R-squared:  0.2695 
## F-statistic: 79.57 on 4 and 848 DF,  p-value: < 0.00000000000000022
## 
## 
## Response AST :
## 
## Call:
## lm(formula = AST ~ Salary + Age + Ht + Wt, data = per_season_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19532 -0.25254 -0.03732  0.23958  1.39224 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)  3.961957922  0.472385780   8.387 < 0.0000000000000002 ***
## Salary       0.226291042  0.017804571  12.710 < 0.0000000000000002 ***
## Age          0.029963959  0.005042591   5.942    0.000000004103941 ***
## Ht          -0.047990884  0.006437769  -7.455    0.000000000000223 ***
## Wt          -0.000013754  0.000001925  -7.144    0.000000000001953 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3878 on 848 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4981, Adjusted R-squared:  0.4957 
## F-statistic: 210.4 on 4 and 848 DF,  p-value: < 0.00000000000000022
## 
## 
## Response TRB :
## 
## Call:
## lm(formula = TRB ~ Salary + Age + Ht + Wt, data = per_season_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2966 -0.2354 -0.0045  0.2348  1.0633 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept) -6.028651737  0.463750672 -13.000 < 0.0000000000000002 ***
## Salary       0.235154187  0.017479107  13.453 < 0.0000000000000002 ***
## Age          0.035885743  0.004950414   7.249    0.000000000000945 ***
## Ht           0.074546974  0.006320088  11.795 < 0.0000000000000002 ***
## Wt           0.000001713  0.000001890   0.906                0.365    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3807 on 848 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4719, Adjusted R-squared:  0.4694 
## F-statistic: 189.4 on 4 and 848 DF,  p-value: < 0.00000000000000022
n = nrow(per_season_data)
p = 4
d = 3
aic_demo = n * log(det(crossprod(salary_demographic_mlm$residuals))) - 2 * p * d
aic_demo
## [1] 12044.69

This model also gives a much higher AIC than the initial model, so we will not consider this model further. Indeed, it appears that the initial model is a well performing model and that including all of those variables is beneficial.